Overview
As researchers, we often collect data from two or more groups.
Group allocations are categorical variables. They are stored in a special way by R which makes it easier to display them on graphs. In this workshop we learn to how to use categorical variables in boxplots, to colour scatterplots, and to make tables of descriptive statistics.
Making boxplots
- Boxplots are useful for comparing categorised data
- Use
aes()withggplot()to choose a categorical column as the x axis geom_boxplot()draws boxes and whiskers for each category- The box describes the interquartile range
- The midpoint is the median
- Individual points show outliers in the data
iris %>%
ggplot(aes(x=Species, y=Petal.Length)) + # specify dataset columns
geom_boxplot() # add the boxesBoxplots are useful for visualising differences between categories or groups.
Using ggplot(), making a boxplot is similar to a scatterplot.
If we haven’t already done it, we’d need to load the tidyverse:
library(tidyverse)To make the plot, we first choose the data we want to plot and add a pipe to the end of the line to send this data to the next command:
iris %>%Next we use ggplot() and aes(x = ..., y = ...) to select the columns we want to use in the data. aes is short for aesthetics which means ‘something you can see’. So the aes() function defines what will be able to see on the plot — the axes, and other features like color.
This time we have chosen Species for the x axis, because it is a categorical variable.
iris %>%
ggplot(aes(x = Species, y = Petal.Length))[DEMONSTRATE USING AUTO COMPLETE IN RSTUDIO WHEN WRITING THE CODE ABOVE]
You’ll notice that I used the RStudio autocomplete feature to write function and column names from our dataset.
This makes things much easier — especially for column names. If you type column names by hand it’s easy to make spelling errors or typos, and this leads to errors in R.
If we run the code so far, ggplot() draws the plot axes, but doesn’t add the data yet:
[RUN THE CODE SO FAR]
So far we have said what we want to show on our plot, but not how it should be shown.
To finish, we add geom_boxplot() which actually draws the boxes:
iris %>%
ggplot(aes(x = Species, y = Petal.Length)) +
geom_boxplot()Note that I used a + symbol to add the boxplot layer to our graph.
So, now we have a boxplot, with one box drawn for each value of Species.
Interpreting boxplots
In a boxplot:
the thick line is the median or midpoint of the data
the height of the box indicates the interquartile range (IQR), so this contains 50% of the datapoints. A wider IQR indicates greater variation (spread) in a dataset.
the whiskers vary a little bit, depending on the software you use — but normally show the range/spread of the data (The default in
ggplot()is to show the points that are no more/less than 1.5 times the IQR above/below the top/bottom of the box.)
In this boxplot, any data point outside the range of the whiskers is described as an ‘outlying point’ or ‘outlier’. Each outlier is plotted as a dot.
Exercise 1
- Create a new chunk at the bottom of your worksheet
- Use the
irisdataset. Create a boxplot withSpecieson the x-axis andSepal.Widthon the y-axis (sepals are the leaves that encase an iris flower)
Your boxplot should look like this:
Adding color to a plot
- The points of a scatterplot can be coloured based on a third column
- We can use categorical or continuous data for this
- Add the
colour=column_nameoption toaes()(changecolumn_nameto the name of your column)
# colour each point; different colour for each Species
iris %>%
ggplot(aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point()As we saw before, scatterplots show the relationship between two variables.
Using the mpg data, we could plot the size of car engines (displ, short for displacement) against their fuel efficiency on the highway in miles per gallon (hwy):
mpg %>%
ggplot(aes(displ, hwy)) +
geom_point()[THIS FIRST PLOT IS NOT SHOWN HERE TO SAVE SPACE, BUT IS SHOWN IN THE VIDEO.]
Colour can be added to distinguish the points — for example, what kind of car each point represents.
So far, you’ve used the aes() function to define which variable is plotted on the x and y axes. The aes() function is short for ‘aesthetics’ (what we can see) and connects columns in our data to visual aspects of a plot, like the x and y axes.
The colour=... option adds to this, and says which column is used to colour the points. In the mpg data, the drv column records if a car is front (f), rear (r) or four (4) wheel drive.
We can write color=drv to tell R to colour each point, depending on which wheels are driven:
# colour each point; different colour for each type of transmission
# front, rear and four wheel drive
mpg %>%
ggplot(aes(displ, hwy, color=drv)) +
geom_point()This is an example of using a categorical variable to colour our points.
Exercise 2
- Create a new chunk at the bottom of your worksheet.
- Create a scatterplot using the
gapminderdataset withgdpPercapon the x-axis,lifeExpon the y-axis, andcontinentin colour. - Run the chunk.
Your plot should look like this:
Using different types of data and visual scales
- Continuous, categorical and text data are all common in psychology
- Internally, R stores columns of data using different data types.
- R uses data-types as a clue to set defaults (e.g. for graphs)
- Sometimes we need to convert between types
- E.g., sometimes categorical data is (wrongly) stored as a numeric column
library(tidyverse)
library(psydata)
# see a list of columns in the dataset, and their types
iris %>% glimpse
# make a scatter plot with two continous axes
# both wt and mpg are numeric variables
# so this works well
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point()
# try and make a boxplot with `am` as the x axis
# but because `cyl` is stored as a numeric variable
# (not categorical) the scale of the x axis is wrong
fuel %>%
ggplot(aes(cyl, mpg)) +
geom_boxplot()
# re-draw the plot, but converting `cyl` to a factor/categorical
# variable first. Now the x axis looks correct
fuel %>%
ggplot(aes(factor(cyl), mpg)) +
geom_boxplot()The video introduces the link different types of data (e.g. continuous, categorical, text), the way R stores them, and the way that ggplot presents them on the scales of a plot.
These might seem like small details — but having some understanding of these details will help later on.
Columns vs variables
The word variable gets used in at least 4 different ways in quantitative research, and this confuses a lot of people. We can’t always avoid this ambiguity because the usages are so common in the field, but it helps to know about them in advance:
variables in a theoretical model. That is, things we think exist, and which cause other things to happen. An example would be “empathy”, or “working memory”.
variables in a study design. For example an experimental group allocation, or an attribute of participants like age or gender.
variables in a dataset. Here we mean a column of numbers in a spreadsheet, where the column has a name. This usage overlaps with the previous two, but doesn’t have to. For example, we might have a column of numbers recording participants’ scores on an empathy questionnaire. But we might also have a column containing reaction times for multiple trials; analysed a certain way these reaction times might tell us something about working memory, but the column of reaction times isn’t the same thing as the variable in our theoretical model.
variables in R: these are a general purpose container which can store anything. Variables can contain columns of numbers, but they can also contain whole datasets, or graphs, or the results of statistical tests. A variable in R often isn’t the same thing as an experimental variable.
In this guide use variable to mean either an experimental/theoretical variable, or an R container. When we mean a column of data in a dataset we will use the word column instead.
Types of variable
If we are thinking about variables in our study designs, the main types are:
- categorical or nominal variables: these are sometimes also called factors in experimental design)
- binary variables: these are either true or false, and are a special kind of categorical data
- interval or continuous variables: e.g. heights, weights, or reaction times
- ordinal variables: e.g. a response to a Likert-style question, on a 1-7 scale
- count variables: how often something has occurred, measured in whole numbers greater than or equal to zero
Types of column
Datasets have multiple columns, each with a unique name.
There are different types of column in R. These data-types are mostly determined by the sort of variable. But there is some overlap, and the same variable could be stored in different types of column. We sometimes also need to convert between data types.
There are three data-types you need to know about in R:
- Factors, which are used for categorical data.
- Text, also called character or string data.
- Numeric columns can called be either integer or double. Double is computer speak for ‘double-precision number’, which just means decimals and really big numbers are allowed
- Logical, which can be used to store values that can only be either True or False.
You can see which data-type is used to store a variable when using the glimpse command you saw in session 1 (e.g. here).
iris %>% glimpse
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …In the glimpse output you can see the variable names listed on the left, followed by grey text surrounded by angle brackes, e.g.: <dbl> which is the abbreviated data-type.
In this built-in dataset, most of the data is numeric (dbl), but the Species variable is categorical, and stored as a factor (fct).
It’s possible to convert columns from one type to another, to suit our needs, and we’ll see more of this later.
| Data/column type | Useful for | Abbreviations used/subtypes | Often need to convert from |
|---|---|---|---|
| Numeric | Continuous, ordinal | int, dbl |
Categorical; Text |
| Factor | Categorical, ordinal | fct, ord |
Text |
| Text | Free text, categorical | chr |
Categorical |
| Logical | Binary/boolean | lgl |
Numeric |
Column types and scales on graphs
If we look at the mtcars data we can see that all the columns are stored as numeric data (dbl):
mtcars %>% glimpse()
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…This is fine if we want to make a scatter plot (here of ‘miles per gallon’ vs the weight of the car):
mtcars %>%
ggplot(aes(wt, mpg)) +
geom_point() In this plot both the x and y axes are continuous. That is, they are numeric variables, using real numbers.
However, if we want to make a boxplot of the mpg column using cyl (number of cylinders) as the x axis then we have a problem:
fuel %>%
ggplot(aes(cyl, mpg)) +
geom_boxplot()
Warning: Continuous x aesthetic -- did you forget aes(group=...)?We might have expected to see;
- miles per gallon on the y axis
- three separate boxes, one for 4, 6 and 8 cylinders
This doesn’t work as expected though.
R spots that cyl is stored as numeric data, so it creates a continuous scale on the x axis. Only one box is drawn at the midpoint of all the values of cyl; because cyl ranges from 4 to 6 the box appears at 6.
We want to use cyl as a categorical variable, so we should convert it to a factor. Then our plot will work properly.
We can use the command factor(cyl) to tell R that the x-axis is a factor:
mtcars %>%
ggplot(aes(factor(cyl), mpg)) +
geom_boxplot()This gives us the boxplot we were expecting. The only change was to replace cyl with factor(cyl). This tells R to convert the variable cyl to a factor, and ggplot can then set the scale of the x axis correctly.
Exercise 3
Work with a friend: Describe the 4 ways in which quantitative researchers might use the word ‘variable’?
(If you need to look these up from the video or text above then try testing yourself again after completing other exercises).
Exercise 4
Use glimpse to check the data types of the mpg and the diamonds datasets.
- The 4th variable in the
mpgdata is a - The 5th variable in the
diamondsdata is a
Exercise 5
Use the fuel dataset to make a boxplot showing miles per gallon on the y axis, and number of gears on the x axis (gear). Your plot should look like this:
Using group_by() to make a summary table and compare groups
- Datasets often contain categorical variables
- We often want to compare statistics (like averages) between categories
- The
group_byfunction is a quick way to combine filtering and summarising group_bycreates a grouped dataframe- Adding
group_by()to a pipeline runs the subsequent steps once for each group. - The result is always a new dataframe
# check the columns in the funimagery dataset
# results of an RCT of functional imagery training for weight loss
# conducted at Plymouth University
funimagery %>% glimpse
# boxplot to compare the intervention groups
funimagery %>%
ggplot(aes(intervention, weight_lost_end_trt)) +
geom_boxplot()
# calculate mean weight loss in each group in a laborious way
funimagery %>%
filter(intervention=="MI") %>%
summarise(mean(weight_lost_end_trt))
funimagery %>%
filter(intervention=="FIT") %>%
summarise(mean(weight_lost_end_trt))
# use group_by to split the data and summarise each group
funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))
# store the result (also a dataframe) in a new variable
average_weight_loss <- funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))
# example of grouping by two columns at once
funimagery %>%
group_by(gender, intervention) %>%
summarise(mean(weight_lost_end_trt))
# calculate the mean and SD in one go
funimagery %>%
group_by(intervention) %>%
summarise(
mean(weight_lost_end_trt),
sd(weight_lost_end_trt)
)
# how to give your new summary columns a name
# (this is good practice)
funimagery %>%
group_by(intervention) %>%
summarise(
Mean_weight_lost_end_trt = mean(weight_lost_end_trt),
SD_weight_lost_end_trt = sd(weight_lost_end_trt)
)In this video we’ll use the funimagery dataset (in the psydata) package. This contains data from a randomised controlled trial run in Plymouth (Solbrig et al., 2019).
The study compared two treatments for weight loss: Functional Imagery Training (FIT) and Motivational Interviewing (MI).
We can see the columns in the dataset using glimpse:
funimagery %>% glimpse
Rows: 112
Columns: 8
$ gender <chr> "f", "f", "f", "f", "f", "f", "f", "m", "f", "f",…
$ age <int> 44, 32, 33, 21, 27, 56, 50, 57, 34, 25, 70, 56, 5…
$ kg1 <dbl> 107.8, 107.0, 99.5, 80.0, 81.0, 59.0, 95.0, 90.0,…
$ kg2 <dbl> 106.7, 105.4, 101.0, 79.0, 80.0, 57.0, 92.0, 87.0…
$ kg3 <dbl> 106.0, 105.9, 98.8, 78.0, 80.0, 60.0, 92.0, 87.0,…
$ person <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ intervention <fct> MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, MI, M…
$ weight_lost_end_trt <dbl> -1.1, -1.6, 1.5, -1.0, -1.0, -2.0, -3.0, -3.0, 0.…Weight was measured three times, and these measurements are in the columns called kg1, kg2 and kg3. The first was taken at baseline (kg1). The final measurement (kg3) was made 12 months after patients started the internetion. The weight_lost_end_trt column is the difference between kg1 and kg3.
We have already seen a plot like this which can compare scores between categories:
funimagery %>%
ggplot(aes(intervention, weight_lost_end_trt)) +
geom_boxplot()This graph is helpful, and makes the differences really clear. But what if we want the actual numbers in a table, or to report in the text of our paper?
Using filter and summarise
One option would be to filter our data first and then summarise each subset in turn:
funimagery %>%
filter(intervention=="MI") %>%
summarise(mean(weight_lost_end_trt))
mean(weight_lost_end_trt)
1 -1.245283This gives us the mean for the MI group (shown in brown underneath the code). We then have to repeat that step for the FIT group:
funimagery %>%
filter(intervention=="FIT") %>%
summarise(mean(weight_lost_end_trt))
mean(weight_lost_end_trt)
1 -5.091525That would be repetitive though, and would get annoying if there were lots of categories. Imagine doing something similar for each continent in the development data, for example.
Using group_by()
Instead of using filter, we can use group_by to split our dataset into multiple groups, summarising each one separately:
funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))
# A tibble: 2 x 2
intervention `mean(weight_lost_end_trt)`
<fct> <dbl>
1 MI -1.25
2 FIT -5.09Using group_by() before summarise() splits up the data depending on which intervention the participant was randomised to. The summarise() function calculates the mean for each group separately.
The result is a new dataset with two columns: The name of the intervention, and the average weight lost. We could save this dataset in a new variable for use later, if we liked:
average_weight_losses <- funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt))If you run this code, you can now see the new variable in the Environment, so it’s stored for later.
Nested groups
Sometimes we might want to group by more than one variable at once.
Imagine we wanted to see if men and women responded differently to the two treatments?
With group_by this is really easy: just add the name of another column to group on (gender), separated with a comma:
funimagery %>%
group_by(gender, intervention) %>%
summarise(mean(weight_lost_end_trt))
# A tibble: 4 x 3
# Groups: gender [2]
gender intervention `mean(weight_lost_end_trt)`
<chr> <fct> <dbl>
1 f MI -1.2
2 f FIT -5.56
3 m MI -1.37
4 m FIT -3.84It looks like women lost a bit more weight on average, but FIT was better than MI for both men and women.
Multiple statistics
Finally, we can calculate multiple statistics at once for each of the groups:
funimagery %>%
group_by(intervention) %>%
summarise(mean(weight_lost_end_trt), sd(weight_lost_end_trt))
# A tibble: 2 x 3
intervention `mean(weight_lost_end_trt)` `sd(weight_lost_end_trt)`
<fct> <dbl> <dbl>
1 MI -1.25 2.09
2 FIT -5.09 4.18Give your new variables a name
R gives the new columns a name based on the function we use to summarise.
For example, if we use mean on the weight_lost_end_trt variable then the new column is called mean(weight_lost_end_trt) (see above).
When we use summarise, we can give the new column a specific name like this:
funimagery %>%
group_by(intervention) %>%
summarise(Mean_weight_lost_end_trt = mean(weight_lost_end_trt), SD_weight_lost_end_trt = sd(weight_lost_end_trt))
# A tibble: 2 x 3
intervention Mean_weight_lost_end_trt SD_weight_lost_end_trt
<fct> <dbl> <dbl>
1 MI -1.25 2.09
2 FIT -5.09 4.18The new name shouldn’t include spaces or other ‘special’ characters (there are ways of doing that if you really want, but we’ll cover that later).
Exercise 6
Replace the ? in the code below to to calculate the median weight lost for men and women undergoing FIT and MI in the funimagery dataset:
? %>%
group_by(?, ?) %>%
summarise( ?(kg3) )These are the correct numbers:
| Gender | Intervention | Median weight loss |
|---|---|---|
| Female | MI | -1.3 |
| Female | FIT | -4.6 |
| Male | MI | -2.1 |
| Male | FIT | -4.3 |
Exercise 7
Use the built-in iris dataset
Use group_by and summarise to calculate the average Sepal.Length for each Species of flower.
Exercise 8
chickwts contains data for the weights of chicks (in grams) fed on different diets.
glimpse(chickwts)
Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 18…
$ feed <fct> horsebean, horsebean, horsebean, horsebean, horsebean, horsebe…Using group_by and summarise. Calculate the mean and standard deviation chick weights for each type of feed.
The mean weight of chicks fed on linseed was (to 2 decimal places) g.
The standard deviation of chicks fed on sunflower was (to 2 decimal places) g.
Check your knowledge
Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers are revealed in Session 4.
- Which functions are needed to make a boxplot?
- What is the difference between a
dbland afctorord? - Which datatypes are used for continuous variables?
- Give an example of why the difference between
dblandfctmatters when making a plot (include code examples for this if you can) - How can you convert a variable from a
dblto afct? - How could you calculate the mean for one level of a factor?
- How would you calculate the mean for all levels of a factor (e.g. for continent in the
developmentdataset?
Extension exercises
Extension exercise 1
Make a scatterplot of the diamonds data. Show carat on the x-axis, price on the y-axis and the clarity of the diamond in colour. Try to produce your plot before comparing it against the answer using the button below.
Interpret your plot: Which category of diamond clarity has the steepest rise in price as size (carat) increases?
Extension exercise 2
Make a scatterplot of the fuel data. Show mpg on the y-axis, engine size on the x-axis, and use type of transmission (auto/manual) to colour the points. Try to produce your plot before comparing it against the answer using the button below.
Interpret your plot: Which type of car (auto or manal) sees the strongest relationship between engine size and fuel economy?
Extension exercise 3
Use the development data. Make a boxplot showing life expectancy by continent for years greater than 1999. (Hint: use filter(), ggplot() and geom_boxplot().)
The plot should look like this:
Extension exercise 4
This boxplot uses the gapminder dataset to show lifeExp (life expectancy) on the y-axis for each continent on the x-axis.
In a new chunk, write the R code to produce this plot.
Extension exercise 5
Create a boxplot which shows drivetrain on the x-axis and miles per gallon when a car is driven in the city on the y-axis. Your plot should look like this:
Extension exercise 6
Try to recreate the plot below. Remember to use factor to convert the type of the column.
Extra (advanced) techniques
‘Flipping’ axes
One neat trick when using boxplots is to ‘flip’ the axes, so the boxes are horizontal. This can save space and make the graph easier to read when there are many categories.
E.g. compare this:
asia <- development %>%
filter(continent=="Asia")
asia %>%
ggplot(aes(country, life_expectancy)) +
geom_boxplot()With this:
asia %>%
ggplot(aes(country, life_expectancy)) +
geom_boxplot() +
coord_flip()All we had to do was add + coord_flip() to the end of our code. This is short for ‘flip coordinates’ (i.e. swap the x and y axes).
Get things in order
Another useful technique is to sort the boxes by their average, or some other value.
asia %>%
ggplot(aes(fct_reorder(country, life_expectancy, median, .desc=TRUE), life_expectancy)) +
geom_boxplot() +
coord_flip()Here we changed country to read fct_reorder(country, life_expectancy, median). Let’s unpack that. Taking each part in order, it means:
- reorder a factor (i.e. change the ordering of the categories)
- we are going to reorder the
countrycolumn - reorder the categories using values in the
life_expectancycolumn - use the
medianto decide the sorting order .desc=TRUEmeans use a descending sort (i.e. we do a reverse sort on the median)
The result is a plot where all the countries are arranged in descending order of their median life expectancy.
Adding labels
If we want to make the x and y labels on our plot nicer, we can use + xlab("label") and + ylab("label").
For example:
fuel %>%
ggplot(aes(factor(gear), mpg)) +
geom_boxplot() +
xlab("Number of gears") + ylab("Fuel economy (mpg)")(Advanced) Extension Exercise: 1
This plot of the American countries is OK, but it’s hard to read the labels:
Can you convert it to look more like this?
Reflection: Imagine someone using the graph for various different purposes. Is it always helpful to sort the countries in order of life expectancy? How should we decide when and how to sort the data?
References
Solbrig, L., Whalley, B., Kavanagh, D. J., May, J., Parkin, T., Jones, R., & Andrade, J. (2019). Functional imagery training versus motivational interviewing for weight loss: A randomised controlled trial of brief individual interventions for overweight and obesity. International Journal of Obesity, 43(4), 883–894.